6 research outputs found

    A Non-Deterministic Strategy for Searching Optimal Number of Trees Hyperparameter in Random Forest

    Get PDF
    International audienceIn this paper, we present a non-deterministic strategy for searching for optimal number of trees hyperparameter in Random Forest (RF). Hyperparameter tuning in Machine Learning (ML) algorithms is essential. It optimizes predictability of an ML algorithm and/or improves computer resources utilization. However, hyperparameter tuning is a complex optimization task and time consuming. We set up experiments with the goal of maximizing predictability, minimizing number of trees and minimizing time of execution. Compared to the deterministic search algorithm, the non-deterministic search algorithm recorded an average percentage accuracy of approximately 98%, number of trees percentage average improvement of 44.64%, average time of execution mean improvement ratio of 213.25 and an average improvement of 94% iterations. Moreover, evaluations using Jackkife Estimation show stable and reliable results from several experiment runs of the non-deterministic strategy. The non-deterministic approach in searching hyperparameter shows a significant accuracy and better computer resources (i.e cpu and memory time) utilization. This approach can be adopted widely in hyperparameter tuning, and in conserving utilization of computer resources like green computing

    Machine Learning Algorithms for Soil Analysis and Crop Production Optimization: A review

    Get PDF
    International audienc

    Using parallel random forest classifier in predicting land suitability for crop production

    Get PDF
    International audienceIn this paper, we present an optimized Machine Learning (ML) algorithm for predicting land suitability for crop (sorghum) production, given soil properties information. We set-up experiments using Parallel Random Forest (PRF), Linear Regression (LR), Linear Discriminant Analysis (LDA), KNN, Gaussian NaĂŻve Bayesian (GNB) and Support Vector Machine (SVM). Experiments were evaluated using 10 cross fold validation. We observed that, parallel random forest had a better accuracy of 0.96 and time of execution of 1.7 sec. Agriculture is the main stream of food security. Kenya relies on agriculture to feed its population. Land evaluation gives potential of land use, in this case for crop production. In the Department of Soil Survey in Kenya Agriculture and Livestock Research Organization (KALRO) and other soil research organizations, land evaluation is done manually, is stressful, takes a long time and is prone to human errors. This research outcomes can save time and improve accuracy in land evaluation process. We can also be able to predict land suitability for crop production from soil properties information without intervention of a soil scientist expert. Therefore, agricultural stakeholders will be able to efficiently make informed decisions for optimal crop production and soil management

    Forêt Aléatoire, Hyperparamètres d'Optimisation, Parallélisation de GPU et Application à l'Analyse de Sol pour l'Optimisation des Cultures

    No full text
    Les travaux développés dans cette thèse se sont concentrés sur l’évaluation des algorithmes d’Apprentissage Automatique au travers de l’algorithme des forêts aléatoires (Random Forest). L’évaluation des terres pour une production optimale des cultures est aujourd’hui fait manuellement, ce qui la rend longue et prédisposé aux erreurs humaines. Certains algorithmes d’Apprentissage Automatique (régression linéaire, analyse factorielle discriminante, k-plus proches voisins, gaussien naïf et bayésien, séparateurs à vaste marge) ont été testés et évalués sur des ensembles de données. L’algorithme des forêts aléaloires a permis de développer un classifieur des données sur les sols, et a permis le développement d’un expert sans implication d’un expert humain en science du sol. Cette approche peut améliorer le processus d’évaluation des terres et offrir des services d’évaluation des terres agricoles. Deux approches d’optimisation des performances de l’algorithme des forêts aléatoires ont été développées. Tout d’abord, un algorithme non déterministe a été formulé pour optimiser le temps d’exécution et la précision. Les résultats ont été comparés aux résultats d’une recherche exhaustive déterministe. Ensuite, les moyens d’opter pour la parallélisation de la construction des forêts aléatoires sur GPU a été avaluée pour réduire le temps d’exécution de l’apprentissage d’un tel classifieur. Version séquenciel, version parallèle et version parallèle à gros grain dynamique ont été étudiés et proposés dans des solutions nommées respectivement seqRFGPGPU, parRFGPU et dpRFGPU. Les résultats montrent que seqRFGPGPU obtient des temps d’exécution réduit, avec des accélérations moyennes intéressantes pour parRFGPU et dpRFGPU. La mise au point de l’algorithme RF a conduit historiquement au développement de nombreuses bibliothèques implémentant cet algorithme et à son utilisation sur une variété très diversifiée de problèmes et d’ensembles de données. La plupart des implémentations de RF sont basées sur une idée originale proposée par Léo Breiman en 2001. Les variations vont des plates-formes de mise en oeuvre à l’introduction de nouvelles idées comme de nouvelles approches de division des données, afin d’améliorer les performances et la précision. Les solutions d’optimisation des hyperparamètres et de parallélisation GPU en sont examinées dans la version complète de cette thèse rédigée.Research developed in this thesis focused on the evaluation of Machine Learning algorithms through the Random Forest algorithm (RF). The evaluation of land for optimal crop production is nowadays done manually, which makes it long and prone to human error. Some Machine Learning algorithms (linear regression, discriminant factor analysis, k-nearest neighbours, naive Gaussian and Bayesian, support vector machines) were tested and evaluated on data sets. The RF algorithm has made it possible to develop a soil data classifier, and has made it possible to develop an expert without the involvement of a human expert in soil science. This approach can improve the land valuation process and provide valuable land evaluation. Two approaches to optimize RF performances have been developed. First, a non-deterministic algorithm was formulated to optimize execution time and accuracy. The results were compared to a classical deterministic research. Then, parallelization strategies for the construction of RF on GPUs was approved to reduce the learning time of such a classifier. Sequential version, parallel version and parallel version with large dynamic grain were studied, proposed and tested in solutions named seqRFGPU, parRFGPU and dpRFGPU respectively. The results show that seqRFGPGPU reduces execution times, with interesting average accelerations for parRFGPU and dpRFGPU. The development of the RF algorithm has historically led to the development of many libraries implementing this algorithm and its use on a wide variety of problems and datasets. Most RF implementations are based on an original idea proposed by Léo Breiman presented in 2001. A wide range of implementations has been introduced such as new data division approaches to improve performance and accuracy. The solutions for application to soil analysis, hyperparameter optimization and GPU parallelization are discussed in the full version of this thesis written in English

    A Novel Tightly Coupled Information System for Research Data Management

    No full text
    Most research projects are data driven. However, many organizations lack proper information systems (IS) for managing data, that is, planning, collecting, analyzing, storing, archiving, and sharing for use and re-use. Many research institutions have disparate and fragmented data that make it difficult to uphold the FAIR (findable, accessible, interoperable, and reusable) data management principles. At the same time, there is minimal practice of open and reproducible science. To solve these challenges, we designed and implemented an IS architecture for research data management. Through it, we have a centralized platform for research data management. The IS has several software components that are configured and unified to communicate and share data. The software components are, namely, common ontology, data management plan, data collectors, and the data warehouse. Results show that the IS components have gained global traction, 56.3% of the total web hits came from news users, and 259 projects had metadata (and 17 of those also had data resources). Moreover, the IS aligned the institution’s scientific data resources to universal standards such as the FAIR principles of data management and at the same time showcased open data, open science, and reproducible science. Ultimately, the architecture can be adopted by other organizations to manage research data
    corecore